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Abstract 

Background: A small number of prognostic and predictive tests based on gene expression are currently offered as 
reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been 
identified in other genomic-based predictors and the success rate for developing clinically useful genomic 
signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting 
of computational research. As a result, a need has emerged for a template for reproducible development of 
genomic signatures that incorporates full transparency, data sharing and statistical robustness. 

Results: Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an 
FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software 
and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an 
example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses 
only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival. 

Conclusions: Our study provides a prototypic example for reproducible development of computational algorithms 
for learning prognostic biomarkers in the era of personalized medicine. 

Keywords: Reproducible research, Gene expression analysis, Biomarkers, Top scoring pair, Prediction, Genomics, 
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Background 

Currently, a number of molecular-based prognostic and 
predictive tests for breast cancer are offered as labora- 
tory services for clinical use [1,2]. Such assays, which in- 
clude MammaPrint [3], OncotypeDx [4], PAM50 Breast 
Cancer Intrinsic Subtype Classifier [5], MapQuant Dx [6] 
and Theros Breast Cancer Index [7], are implemented 
by providing multiple gene expression measurements 
obtained from tissue samples to multivariate classi- 
fication algorithms. Currently, published evidence on 
clinical validity and utility for such assays as they are 
offered to the patients is only available for MammaPrint 
and OncotypeDx; for the remainder of these tests the 
evidence derives from analyses performed in academic 
settings [2]. 
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According to a recent report [8] from the Institute of 
Medicine (IOM), OncotypeDx was the most widely 
used among these breast cancer assays, with more than 
175,000 patients tested as of mid 2011, followed by 
MammaPrint, used for 14,000 patients. OncotypeDX 
combines the expression levels of 21 genes and was de- 
veloped to predict the risk of distant recurrence at 
10 years for women with lymph node negative, estrogen 
receptor (ER) positive breast cancer [4], MammaPrint 
utilizes 70 genes to report a good or bad prognosis for 
each patient, and was developed from microarray ex- 
periments to predict 5-year metastatic recurrence of 
breast cancer as a first event among ER positive and 
negative patients [9,10]. The MammaPrint algorithm is 
based on correlating the 70-gene expression profile of a 
patient with a stored cancer profile in order to deter- 
mine a risk score for the patient. 

A relative small fraction of published cancer prog- 
nostic markers have subsequently been introduced in 
clinical practice, despite the large number of available 
studies focusing on biomarkers development. A major 
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hurdle hindering the translation of this research into 
clinically useful assays has been identified in the lack of 
rigorous criteria to report and publish tumor prognostic 
marker studies [8]. This issue has been addressed by 
introducing the REMARK guidelines, a set of recom- 
mendations for tumor marker prognostic studies, which 
provides the necessary framework for reporting all rele- 
vant information about prognostic marker development 
(i.e. study design, specimen and patient characteris- 
tics, analytical and statistical methods) [11]. Another 
key issue in the development of cancer biomarkers is 
the need for detailed and complete disclosure of all 
data and software [8,12,13]. This need is not specific 
to the development of predictive signatures from 
high-throughput molecular data but extends to many 
other branches of computational medicine and biology 
[14,15]. Whereas the guidelines for transparency in 
genomic data sharing date back a decade to the adop- 
tion of the Minimal Information About Microarray Ex- 
periments (MIAMIE) standards [16], the recent scandal 
leading to the decision to cancel three clinical trials 
based on microarray-based gene expression screening 
tests has dramatically underscored the need for revised 
genomics research criteria [17] that extend and/or inte- 
grate the REMARK and MIAME guidelines. 

Maximizing the level of evidence on the spectrum of re- 
producibility requires complete, independent replication 
[18]. As measured by this criterion, neither of the two suc- 
cessful breast cancer assays, MammaPrint and 
OncotypeDX, provides a paradigmatic example of the way 
genomic predictors should be developed. In the case of 
OncotypeDX, the prediction algorithm is described in detail 
and can be reprogrammed, but the original datasets used 
for the implementation and validation [4] of the assay were 
never placed in the public domain. Conversely, in the case 
of MammaPrint, although the original discovery and valid- 
ation datasets [3,19] are available, the pre-processing proto- 
col and prediction algorithm are only partially described. 

Thus the entire development, including data and code, 
is not available for either MammaPrint nor OncotypeDX. 
However, in the case of MammaPrint it is possible to 
undertake a transparent re-analysis of the data using an al- 
ternative approach, since the raw microarray data are 
available. We therefore focus here our efforts on reprodu- 
cing the results of Mammaprint. We collect and organize 
the original MammaPrint discovery and validation data. 
We also coordinate the associated metadata for these ex- 
periments and develop reproducible documents for their 
analysis. We reproduce and implement the preprocessing 
described in the original manuscripts. These data repre- 
sent a resource that can be used by other investigators 
both to verify the original claims about the MammaPrint 
signature and to build alternative predictors. As an ex- 
ample of the utility of these data, we use the MammaPrint 



discovery and validation data to develop an alternative sig- 
nature and prognostic test for breast cancer, which is 
based on several two-gene comparisons [20,21]. This pro- 
vides a detailed, transparent and fully reproducible ex- 
ample of constructing a multi-gene classifier. 

Methods 

Data assembly and code 

We collected the data from the original experiments used to 
identify [9] and develop [10] the MammaPrint 70-gene 
prognostic signature as provided as additional files with the 
original manuscripts. We also collected from ArrayExpress 
[22] the dataset used to retrain this signature on the custom 
array currently used in the MammaPrint assay [3] as well as 
the independent validation cohort using the same array [19]. 
All of these datasets have been organized in an open re- 
source that can be used to develop and compare prognostic 
signatures for breast cancer (available at http:// 
luigimarchionni.org/breastTSP.html) and Bioconductor 
[23]. This resource also encompasses the R [24] code and 
libraries used to retrieve, pre-process, manipulate, annotate, 
and analyze these data. The code, fully annotated and exe- 
cutable, is provided in the Additional files 1 and 2. All the 
analyses performed in our study were based on de-identi- 
fied publically available data, and they were performed in 
compliance to the Helsinki declaration. The research did 
not involve any experiment on human subjects or animals 
and for this reason no ethical approval was necessary. 

An example of reproducible signature development 

In order to build our new classifier we selected the 78 
patients originally used in the 70-gene prognostic signa- 
ture discovery and limited our analysis to the 70 genes 
contained in the original signature. We made these deci- 
sions for two reasons: (a) to make our development 
process entirely analogous to the process for 
MammaPrint and (b) so that our signature can be calcu- 
lated on the basis of the data from any current 
MammaPrint assay. To this end it should also be noted 
that the MammaPrint microarray platform only includes 
the prognostic signature genes and a set of housekeeping 
genes used for normalization purposes. These latter 
genes are designed not to change across samples and 
were therefore not used to train our predictor. We 
adopted an extension of a rank-based approach to classi- 
fication called "top-scoring pairs" (TSP) for developing 
understandable and powerful genomic signatures. This 
approach is invariant to all data preprocessing and 
normalization steps that maintain the ordering within 
sample gene expression profiles. The TSP algorithm se- 
lects the pair of genes whose expression levels switch 
their ranking most consistently between the two prog- 
nostic groups (Figure 1). The original TSP algorithm 
[20] and extensions [25] have previously been 
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Figure 1 Top Scoring Pair. A Top Scoring Pair (TSP) is formed by a 
pair of measurements that consistently change ranking between 
samples from different prognostic groups. 



successfully applied to differentiate [26], predict treat- 
ment response in breast cancer [27] and acute myeloid 
leukemia [28], and grade prostate cancers [29]. 

Building the K-TSP classifier 

We recorded the relative ordering of each pair of 
genes in the 70-gene MammaPrint signature in each of 
the 78 training samples. In other words, for each pair 
of genes g and g', and for each sample j, we record 
whether the expression of g in sample j is larger than 
the expression of g' in sample j or vice-versa. The "sig- 
nature" for the TSP classifier is the pair of genes that 
most consistently changes its relative expression order- 
ing between the two groups of patients and the corre- 
sponding decision rule for a new profile is determined 
entirely by the ordering between these two genes: 
choose group one if the observed ordering was most 
often seen in group one and group two otherwise. 
Here, the two groups of patients are those that re- 
curred within 5 years (poor prognosis) and those that 
who did not recur (good prognosis). The K-TSP algo- 
rithm uses K pairs of genes. It proceeds by first identi- 
fying the TSP, removing these two genes from the 70- 
gene signature, then searching for the pair of genes 
among the 68 remaining that most often switch their 
ordering between groups, removing these from the list, 
and so forth. Individually, each pair of genes "votes" 
for one of the two groups based on the observed or- 
dering. For a fixed number K of pairs, the final prog- 
nostic score is the sum of the votes for the poor 
prognosis group among all K pairs. The higher the 
score, the more evidence there is for poor prognosis. 

Selecting the number of pairs 

For each possible number of pairs K we measured the 
accuracy of the prognostic score on the training set by 
calculating the area under the receiver operating charac- 
teristic curve (AUC) [30] determined by considering all 
possible score thresholds for declaring poor prognosis. 



Here we used re-substitution AUC for training, since the 
TSP approach is based on binary decisions and is not 
prone to overfitting. The AUC increased with K until 
reaching a peak and then declined as further pairs were 
added (Figure 2). We focused on values of K near the 
peak AUC, namely K = 6 to K = 10, and only considered 
score thresholds achieving 100% sensitivity. The number 
of gene pairs K was then chosen to maximize specificity, 
which is equivalent to choosing the maximum score 
threshold which achieves 100% sensitivity. This resulted 
in the 8-TSP classifier (Figure 3) with score threshold 
two. Such resubstitution estimates obtained from the 
training set of samples were used only for the model 
optimization and do not reflect its performance, which 
in turn was assessed on an independent cohort of pa- 
tients (see below). 

Validation of the 8-TSP signature in an independent 
patients cohort 

To evaluate the classifier on a new sample, the relative 
ordering of each of the K = 8 pairs of genes is deter- 
mined and the sample is assigned to the poor prognosis 
group if there are two or more votes for poor prognosis 
(Figure 3), using the same procedures previously defined 
in the training set of patients. The 8-TSP signature and 
the MammaPrint test were hence compared in terms of 
classification performance, using standard measures 
such as accuracy, sensitivity, specificity, and AUC, and 
in term of survival, by Kaplan-Meier and Cox regression 
analyses. 

Results and discussion 

We compared our prognostic test to the MammaPrint 
test based on a large independent validation cohort 
consisting of 307 patients from a European multi-center 



AUC versus Number of TSP 
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Figure 2 Resubstittution performance in the training set. 

Receiver Operator Characteristics (ROC) analysis was performed in 
the training set and the Area Under the Curve (AUC) was used to 
select the final number of TSPs. An 8-TSP classifier was chosen to 
maintain 100% training set sensitivity and maximize specificity. 
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Figure 3 8-TSP breast cancer prognosis signature. Each of the 8 gene pairs votes independently; patients with two or more votes are 
classified as poor prognosis. 



study [19]. In this independent validation cohort our test 
achieved 91% sensitivity, 47% specificity, and 69% over- 
all accuracy (Figures 4A and 4B, and Additional files 1 
and 2). Sensitivity refers to correctly classifying poor 
prognosis patients and specificity refers to correctly 
classifying good prognosis patients. For comparison, 
the MammaPrint prognostic test achieves 89% sensitivity, 
42% specificity, and 65% overall accuracy [19,31] in this 
same validation set. Such performance in predicting 
metastatic recurrence within 5 years was reflected in the 
AUC estimates: 0.69 (95% CI: 0.64 - 0.74) and 0.59 (95% 
CI: 0.55 - 0.62) for the 8-TSP and MammaPrint respect- 
ively. (Comparable results were obtained by PAM, a well- 
known classification method; see the Additional files 1 
and 2.) Finally, while in the prediction of a metastatic 
event within five years the 8-TSP classifier performed 
better than the MammaPrint test, this latter assay 
maintained a better performance at later time points as 
revealed in survival analyses. This finding probably indi- 
cates that the additional features of the 70-gene signature 
not used in the 8-TSP classifier might carry additional 
prognostic information beyond five years (see Additional 
files 1 and 2). 

We have therefore built a prognostic classifier based 
on the genes from the MammaPrint signature that is as 
accurate in predicting 5-year disease-free survival as the 
MammaPrint prognostic test based. Our classifier only 
requires the measurement of expression for 16 of the 70 
genes used in Mammaprint. Moreover, the new test is 
easy to interpret and is robust with respect to any 



preprocessing of the expression data that maintains the 
ordering among expression levels within sample profiles. 

Finally, all design decisions and choices of parameters 
were based entirely on the training set. There was no 
"data leakage": no test data was examined until all as- 
pects of classifier development were "locked up." These 
are considered critical steps in developing reproducible 
and accurate genomic signatures as defined by the IOM 
report [8]. The two key parameters are K, the number of 
pairs of genes in the signature, and the score threshold. 
We only considered values of K between 6 and 10 since 
these values maximized overall performance, and we 
only considered thresholds that obtained 100% sensitiv- 
ity. Under these design constraints, we selected the K = 8 
since this value maximized specificity at 100% sensitivity 
(Figure 2). Our final classifier labels a sample as poor 
prognosis if two or more among the 8 pairs votes for the 
poor prognosis group (Figure 3). 

Our 8-TSP signature can be viewed as the combin- 
ation of multiple coordinated biological processes. Of 
the 70 genes originally identified in the study by van't 
Veer and colleagues [10], 18 genes had expression values 
positively associated with good prognosis, while 52 were 
associated with metastatic recurrence. Four of the K = 8 
pairs combine genes positively correlated with good 
prognosis (RTN4RL1, LGP2, MS4A7, and GSTM3) with 
genes associated with bad prognosis (OXCT1, HRASLS, 
Contig40831_RC, and MELK). These pairs represent a 
coordinated change from good prognosis expression pat- 
terns to poor prognosis patterns across multiple gene 
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Figure 4 8-TSP classification results in the validation set. Panel 

A) The 8-TSP results from the first 150 patients in the validation set. 
Each column represents one of the 8 pairs (blue = good prognosis vote, 
red = bad prognosis vote) and each row is a patient. Patients with bad 
prognosis (top rows) have more votes for bad prognosis. Panel B) The 
8-TSP results from the last 157 patients in the validation set. 



pairs. The remaining pairs comprise only genes origin- 
ally associated with a poor prognosis (GPR180, DTL, 
IGFBP5, SERF1A, GNAZ, RFC4, CDCA7, and UCHL5), 
suggesting that it is the quantitative level of expression 
of these genes that is important for predicting prognosis. 

It is of note that each individual TSP involved in the 
final classification scheme can be viewed as a separate 
molecular switch between the two prognostic groups, 
possibly entailing also a mechanistic underpinning. To 
this end some of the pairs we have identified appear to 
have an additional underlying mechanistic biological re- 
lationship. For instance one of the gene pairs, DTL- 
RCF4, appears to be tightly associated with the regula- 
tion of the replication fork and the DNA damage re- 
sponse. DTL and RCF4 physically interact and modulate 
the activity of the proliferating cell nuclear antigen 
(PCNA) [32-34], which plays a central role in the coord- 
ination of these processes. Similarly, another pair, 
GPR180-GNAZ, code for proteins involved in G protein 
mediated cellular signaling. 



Conclusions 

Our goal was to provide a transparent example of the 
manner in which a genomics-based cancer predictor 
might be developed from training data and evaluated on 
independent test data with sufficient detail and docu- 
mentation to allow the full process to be replicated by 
other researchers. Due to the unavailability of the ori- 
ginal data, it was not possible carry out this process for 
OncotypeDX, which is presently the most used and vali- 
dated predictor of this kind. Consequently, we 
performed a re-analysis of MammaPrint data. To this 
end, we selected the same samples and end-point origin- 
ally used for the implementation of this assay, although 
we are aware that a stratified analysis across ER positive 
and negative patients would be much more appropriate. 
In order to illustrate the development process from end 
to end, including a transparent decision rule, we have in- 
troduced a more parsimonious classifier with sensitivity, 
specificity, and overall accuracy very similar to the 70- 
gene MammaPrint signature. 

Our analysis was performed in complete adherence to 
the principles of transparent and reproducible research 
[13,18], providing all data sources used, and the 
complete code and software necessary for data prepro- 
cessing, analysis and validation. To our knowledge, this 
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is one of the few, if not the first, development of a gen- 
omic signature adhering to these standards. 

Additional files 



Additional file 1: Fully reproducible vignette of the analysis. 

Additional file 2: The archive contains the following files: 
"bmc_article.bst": BMC series bibliography style; "localFiles/ 
contactAgendia": instructions to obtain the hybridization mapping 
information from Agendia; "objs/buyseEset.rda": ExpressionSet for 
the Buyse cohort; "objs/glasEset.rda": ExpressionSet for the Glas 
cohort; "Supplement.bib": Bibliography in BibTex format; 
"Supplement.Rnw": Rnoweb/Sweave file containing code and text 
used to create the "Supplement.tex" file; "Supplement.tex": LaTeX 
file resulting from running the Sweave with the "Supplement.Rnw" 
file; All source code, data, and software packages used in the 
analyses are also available for download online from: http:// 
luigimarchionni.org/breastTSP.html. 
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